In [1]:

    
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.options.display.max_rows = 10

Automatic alignment on labels



In [2]:

    
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data).set_index('country')
countries









    Out[2]:






  
    
      
      area
      capital
      population
    
    
      country
      
      
      
    
  
  
    
      Belgium
      30510
      Brussels
      11.3
    
    
      France
      671308
      Paris
      64.3
    
    
      Germany
      357050
      Berlin
      81.3
    
    
      Netherlands
      41526
      Amsterdam
      16.9
    
    
      United Kingdom
      244820
      London
      64.9

`Series` alignment

Let's define a table with natural increase rates in 2013 (data from World Bank):



In [3]:

    
death_rate = pd.Series([10, 9, 11, 8, 9],
                       index=['Poland','United Kingdom', 'Germany', 'Netherlands', 'France'])
print(death_rate)









    



Poland            10
United Kingdom     9
Germany           11
Netherlands        8
France             9
dtype: int64



In [4]:

    
birth_rate = pd.Series([10, 9, 10, 12],
                       index=['Netherlands',  'Germany', 'Poland', 'France'])
print(birth_rate)









    



Netherlands    10
Germany         9
Poland         10
France         12
dtype: int64

Now we calculate the natural increae by subtracting death rate from birth rate:



In [5]:

    
natural_increase = birth_rate - death_rate
print(natural_increase)









    



France             3
Germany           -2
Netherlands        2
Poland             0
United Kingdom   NaN
dtype: float64

Note that the rows where the two series did not overlap contain missing values (NaN = Not a Number) and that the data were properly aligned on the index.



In [6]:

    
pop_change = pd.DataFrame({'death rate' : death_rate, 
                           'birth rate' : birth_rate,
                           'natural increase' : natural_increase})

Missing values

We can remove the missing data using dropna method:



In [7]:

    
pop_change.dropna(inplace=True)



In [8]:

    
pop_change









    Out[8]:






  
    
      
      birth rate
      death rate
      natural increase
    
  
  
    
      France
      12
      9
      3
    
    
      Germany
      9
      11
      -2
    
    
      Netherlands
      10
      8
      2
    
    
      Poland
      10
      10
      0

EXERCISE: Calculate estimated population in 2014 by summing the population and natural increase (remember that the natural increase is given per 1000 people).



In [ ]:

EXERCISE: Calculate ratio of the highest and lowest estimated population in 2014.



In [ ]:

Joining two data frames

Let's now try to add the data to the country data:



In [11]:

    
countries.join(pop_change)









    Out[11]:






  
    
      
      area
      capital
      population
      birth rate
      death rate
      natural increase
    
    
      country
      
      
      
      
      
      
    
  
  
    
      Belgium
      30510
      Brussels
      11.3
      NaN
      NaN
      NaN
    
    
      France
      671308
      Paris
      64.3
      12
      9
      3
    
    
      Germany
      357050
      Berlin
      81.3
      9
      11
      -2
    
    
      Netherlands
      41526
      Amsterdam
      16.9
      10
      8
      2
    
    
      United Kingdom
      244820
      London
      64.9
      NaN
      NaN
      NaN

There are four different ways we can handle the missing rows:

left (default) — take all the rows of the left data frame
right — take all the rows of the right data frame
inner — take the common rows of both data frames
outer — take all the rows present in either or both data frames

Note that the methods are similar to SQL JOIN clause.



In [12]:

    
countries.join(pop_change, how='right')









    Out[12]:






  
    
      
      area
      capital
      population
      birth rate
      death rate
      natural increase
    
  
  
    
      France
      671308
      Paris
      64.3
      12
      9
      3
    
    
      Germany
      357050
      Berlin
      81.3
      9
      11
      -2
    
    
      Netherlands
      41526
      Amsterdam
      16.9
      10
      8
      2
    
    
      Poland
      NaN
      NaN
      NaN
      10
      10
      0

EXERCISE: Try inner and outer join. What's the difference?



In [ ]:

Groupby operations

Some 'theory': the groupby operation (split-apply-combine)

By "group by" we are referring to a process involving one or more of the following steps

Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure

Similar to SQL GROUP BY

The example of the image in pandas syntax:



In [14]:

    
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
                   'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df



In [15]:

    
df.groupby('key').aggregate('sum')  # np.sum



In [16]:

    
df.groupby('key').sum()

You can also simply count members of each split:



In [17]:

    
df.groupby('key').size()









    Out[17]:





key
A    3
B    3
C    3
dtype: int64

Movie database

These exercises are based on the PyCon tutorial of Brandon Rhodes (so all credit to him!) and the datasets he prepared for that. You can download these data from here: titles.csv and cast.csv and put them in the /data folder.



In [18]:

    
cast = pd.read_csv('data/cast.csv')
cast[10:15]









    Out[18]:






  
    
      
      title
      year
      name
      type
      character
      n
    
  
  
    
      10
      Mixing Nia
      1998
      Michael 'babeepower' Viera
      actor
      Rapper
      NaN
    
    
      11
      The Replacements
      2000
      Steven 'Bear'Boyd
      actor
      Defensive Tackle - Washington Sentinels
      NaN
    
    
      12
      Dysfunktion
      2016
      Kirlew 'bliss' Vilbon
      actor
      Bliss
      NaN
    
    
      13
      My Song for You
      2010
      George 'Bootsy' Thomas
      actor
      Cooley's Customer
      16
    
    
      14
      My Song for You
      2010
      George 'Bootsy' Thomas
      actor
      Celebration Guest
      16



In [19]:

    
titles = pd.read_csv('data/titles.csv')
titles.head()









    Out[19]:






  
    
      
      title
      year
    
  
  
    
      0
      The Rising Son
      1990
    
    
      1
      Ashes of Kukulcan
      2016
    
    
      2
      The Thousand Plane Raid
      1969
    
    
      3
      Crucea de piatra
      1993
    
    
      4
      The 86
      2015

EXERCISE: Using groupby(), plot the number of films that have been released each year in the history of cinema.



In [ ]:

EXERCISE: Use groupby() to determine how many roles are listed for each of The Pink Panther movies.



In [ ]:

Custom grouping criteria

You can also group by the values on another array under the condition that this array has the length equal to the number of rows:



In [22]:

    
greek = ['α', 'β', 'β', 'β', 'β', 'α', 'β','α', 'α']
df.groupby(greek).max()

The values for the grouping array can be also computed from values in the data frame. For example, to count odd and even number in the data column we could simply:



In [23]:

    
df.groupby(df['data'] % 2).size()









    Out[23]:





data
0    5
1    4
dtype: int64

EXERCISE: Using groupby(), plot the number of films that have been released each **decade** in the history of cinema.



In [ ]:

EXERCISE: Use groupby() to plot the number of "Hamlet" films made each decade.



In [ ]:

Multiple groups

Note that you can also groupby on multiple keys:



In [26]:

    
df['type'] = np.where(df['data'] % 2, 'odd', 'even')
print(df)









    



   data key  type
0     0   A  even
1     5   B   odd
2    10   C  even
3     5   A   odd
4    10   B  even
5    15   C   odd
6    10   A  even
7    15   B   odd
8    20   C  even



In [27]:

    
df.groupby(['type', 'key']).sum()



In [28]:

    
df['type'] = np.where(df['data'] % 2, 'odd', 'even')
print(df)
df['data']









    



   data key  type
0     0   A  even
1     5   B   odd
2    10   C  even
3     5   A   odd
4    10   B  even
5    15   C   odd
6    10   A  even
7    15   B   odd
8    20   C  even






    Out[28]:





0     0
1     5
2    10
3     5
4    10
5    15
6    10
7    15
8    20
Name: data, dtype: int64

Note that it creates a hierarchical index. More on that later.

EXERCISE: List each of the characters that Frank Oz has portrayed at least twice.



In [ ]:

EXERCISE: List, in order by year, each of the films in which Frank Oz has played more than 1 role.



In [ ]:

EXERCISE: How many leading (n=1) roles were available to actors, and how many to actresses, in each year of the 1950s?



In [ ]:

Value counts

A useful shortcut to calculate the number of occurences of certain values is value_counts (this is somewhat equivalent to df.groupby(key).size()))

For example, what are the most occuring movie titles?



In [32]:

    
titles.title.value_counts().head()









    Out[32]:





Hamlet                  19
Carmen                  14
Macbeth                 14
The Three Musketeers    12
Maya                    11
Name: title, dtype: int64

EXERCISE: What are the 11 most common character names in movie history?



In [ ]:

Custom aggregate functions

Aggregate function could be any function accepting the Series object.

For example, let's calculate most frequent apperances in each year of last decade:



In [34]:

    
def most_frequent(x):
    return x.value_counts().index[0]



In [35]:

    
cast.loc[(cast['year'] >= 2010) & (cast['year'] < 2020), ['year', 'name']].groupby('year').agg(most_frequent)









    Out[35]:






  
    
      
      name
    
    
      year
      
    
  
  
    
      2010
      Lloyd Kaufman
    
    
      2011
      Kyle Rea
    
    
      2012
      Stefan Kramer
    
    
      2013
      Eric Roberts
    
    
      2014
      Omer Pasha
    
    
      2015
      Eric Roberts
    
    
      2016
      Bill Oberst Jr.
    
    
      2017
      David (II) Gattis
    
    
      2018
      Dana (II) Waters
    
    
      2019
      Prashast Singh

Extra exercises

EXERCISE: Which actors or actresses appeared in the most movies in the year 2010?

EXERCISE: Plot how many roles Brad Pitt has played in each year of his career.

EXERCISE: What are the 10 most film titles roles that start with the word "The Life"?

EXERCISE: How many leading (n=1) roles were available to actors, and how many to actresses, in the 1950s? And in 2000s?

Acknowledgement

© 2015, Stijn Van Hoey and Joris Van den Bossche (mailto:stijnvanhoey@gmail.com, mailto:jorisvandenbossche@gmail.com).

© 2015, modified by Bartosz Teleńczuk (original sources available from https://github.com/jorisvandenbossche/2015-EuroScipy-pandas-tutorial)

Licensed under CC BY 4.0 Creative Commons



In [ ]:

	area	capital	population
country
Belgium	30510	Brussels	11.3
France	671308	Paris	64.3
Germany	357050	Berlin	81.3
Netherlands	41526	Amsterdam	16.9
United Kingdom	244820	London	64.9

	title	year	name	type	character	n
10	Mixing Nia	1998	Michael 'babeepower' Viera	actor	Rapper	NaN
11	The Replacements	2000	Steven 'Bear'Boyd	actor	Defensive Tackle - Washington Sentinels	NaN
12	Dysfunktion	2016	Kirlew 'bliss' Vilbon	actor	Bliss	NaN
13	My Song for You	2010	George 'Bootsy' Thomas	actor	Cooley's Customer	16
14	My Song for You	2010	George 'Bootsy' Thomas	actor	Celebration Guest	16

	title	year
0	The Rising Son	1990
1	Ashes of Kukulcan	2016
2	The Thousand Plane Raid	1969
3	Crucea de piatra	1993
4	The 86	2015

	name
year
2010	Lloyd Kaufman
2011	Kyle Rea
2012	Stefan Kramer
2013	Eric Roberts
2014	Omer Pasha
2015	Eric Roberts
2016	Bill Oberst Jr.
2017	David (II) Gattis
2018	Dana (II) Waters
2019	Prashast Singh